#AI Safety
2 posts
ยท 15 min read
The Interpretability Illusion: Can We Ever Truly See Inside an AI's Mind?
Mechanistic interpretability was supposed to crack open AI's black box. But what if the AI learns to hide? A deep dive into the arms race between researchers trying to understand AI and models that might learn to deceive their observers.
#AI Deep Dives#AI Safety#Interpretability#Alignment
ยท 13 min read
The AI Observer Effect: When Testing AI Changes AI
If measuring AI changes its behavior, how can we ever verify AI safety? A deep dive into situational awareness, alignment faking, and the Heisenberg uncertainty of AI performance.
#AI Deep Dives#AI Safety#Alignment#Observer Effect